This 'Introduction to R for Data Science' is brought to you by the Centre for the Analysis of Genome Evolution & Function's (CAGEF) bioinformatics training initiative. This CSB1020 was developed based on feedback on the needs and interests of the Department of Cell & Systems Biology and the Department of Ecology and Evolutionary Biology.
This lesson is the fifth in a 7-part series. The idea is that at the end of the series, you will be able to import and manipulate your data, make exploratory plots, perform some basic statistical tests, test a regression model, and make some even prettier plots and documents to share your results.
So far we have discussed the tidyverse and its tools. You've learned a lot of the "verbs" needed to slice, format, and tidy your data. You can now convert from wide to long-format data and we've taken a side-journey into visualizing your data. Now we'll revisit dataset manipulation through text manipulation and regular expressions.
The structure of the class is a code-along style: It is fully hands on. Prior to each lecture, the materials will be emailed to you and will also be available for download at QUERCUS, so you can spend more time coding than taking notes.
At the end of this session you will be able to use tidyverse tools and regular expressions to tidy/clean your data.
stringrToday we are going to be learning data cleaning and string manipulation; this is really the battleground of coding - getting your data into the format where you can analyse it. In the next lesson we will learn how to do t-tests and perform regression and modeling in R.
grey background - a package, function, code, command or directory. Backticks are also use for in-line code.
italics - an important term or concept or an individual file or folder
bold - heading or a term that is being defined
blue text - named or unnamed hyperlink
We have 2 data files:
This is an example file for us to start playing with the idea of regular expressions.
This is the main file that we'll be working with for the rest of the lecture. We'll search, replace, and manipulate data from this file after importing it into our notebooks.
The following packages are used in this lesson:
tidyverse (ggplot2, tidyr, dplyr, stringr)
These packages should already be installed into your Anaconda base from previous lectures and should be readily available in JupyterHub. If not, please review that lesson and load these packages. Remember to please install these packages from the conda-forge channel of Anaconda.
# Load up our libraries
library(tidyverse)
Why do we need to do this?
'Raw' data is seldom (never) in a usable format. Data in tutorials or demos have already been meticulously filtered, transformed and readied to showcase that specific analysis. How many people have done a tutorial only to find they can't get their own data in the format to use the tool they have just spent an hour learning about?
Data cleaning requires us to:
Some definitions might take this a bit farther and include normalizing data and removing outliers. In this course, we consider data cleaning as getting data into a format where we can start actively exploring our data with graphics, data normalization, etc.
Today we are going to mostly focus on the data cleaning of text. This step is crucial for taking control of your dataset and your metadata. I have included the functions I find most useful for these tasks but I encourage you to take a look at the Strings Chapter in R for Data Science for an exhaustive list of functions. We have learned how to transform data into a tidy format in lectures 2 and 3, but the prelude to transforming data is doing the grunt work of data cleaning. So let's get to it!